DOCUMENT RESUME 



ED 437 439 



TM 030 608 



TITLE 



INSTITUTION 
PUB DATE 
NOTE 
PUB TYPE 
EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Peer Reviewer Guidance for Evaluating Evidence of Final 
Assessments under Title I of the Elementary and Secondary 
Education Act. 

Department of Education, Washington, DC. 

1999-11-00 
99p . 

Reports - Descriptive (141) 

MF01/PC04 Plus Postage. 

* Academic Standards ; Accountability; *Compensatory 
Education; Educational Assessment; Elementary Secondary 
Education; *Equal Education; *Evaluation Methods; *Peer 
Evaluation; *State Programs 

*Elementary Secondary Education Act Title I 



ABSTRACT 



The reauthorization of the Elementary and Secondary 
Education Act in 1994 includes explicit requirements to ensure that students 
served by its Title I are given the same opportunity to achieve high 
standards and are held to the same high expectations as all students in each 
state. By the spring of 2000, states should be prepared to submit evidence 
that final assessment systems are in place. To determine whether states have 
met Title I requirements, the U.S. Department of Education will use a peer 
review process involving experts in the fields of standards and assessments. 
This document is intended both to inform states about useful evidence and to 
guide teams of peer reviewers. The document consists of three main sections: 
(1) "General Characteristics of the Assessment System"; (2) "The Core of the 
Assessment System"; and (3) "Reporting and Using Assessment Results in 
Accountability." Subsections deal with specific aspects of these broad areas, 
such as "meeting standards of technical quality." Most subsections include 
discussions of the requirements and their intent and purpose and questions 
peer reviewers should ask. Four appendixes clarify a requirement related to 
students of limited English proficiency, discuss assessment information flow, 
and present a summary of types and sources of evidence for the reviews. (SLD) 



Reproductions supplied by EDRS are the best that can be made 
from the original document. 



PEER REVIEWER GUIDANCE 

FOR EVALUATING EVIDENCE 

OF FINAL ASSESSMENTS 

UNDER TITLE I OF THE 
ELEMENTARY AND SECONDARY 
EDUCATION ACT 






PEER REVIEWER GUIDANCE FOR EVALUATING 

EVIDENCE OF 

FINAL ASSESSMENTS UNDER TITLE I OF THE 
ELEMENTARY AND SECONDARY EDUCATION ACT 



Contents 

Introduction 2 

Part I. General Characteristics of the Assessment System 

A: Content, Grade Levels, and Administration 6 

B: Inclusion 12 

Part II. The Core of the Assessment System 

C. Assessments Must be Aligned to Standards 22 

D. Meeting Professional Standards of Technical Quality 31 

Part III. Reporting and Using Assessment Results in Accountability 

E. Providing Individual Reports 46 

F. Disaggregated Reporting 48 

G. Development of District and School Profiles 50 

H. Ensuring that State Assessments are the Primary Basis for Determining 52 
LEA and School Progress 

I. Include Students who have Attended School in the LEA for a Full 55 

Academic Year 

Appendix A Including LEP Students in State Assessments under Title I: 

“To the extent practicable” 59 

Appendix B Must All the Standards Be Assessed? 69 

Appendix C Summary of Alignment Elements and Illustrative Types and Sources of 

Evidence 74 

Appendix D Summary of Elements and Illustrative Evidence of Technical Quality 78 



November 1999 



INTRODUCTION 



Raising academic standards for all students and measuring student performance to hold schools 
accountable for educational progress are central strategies for promoting educational excellence 
and equity in our schools. The reauthorization of the Elementary and Secondary Education Act 
in 1994 reformed federal programs to support State efforts to establish challenging standards, to 
develop aligned assessments, and to build accountability systems for districts and schools that 
are based on educational results. In particular, the Act includes explicit requirements to ensure 
that students served by Title I are given the same opportunity to achieve to high standards and 
are held to the same high expectations as all students in each State. 

Title I required States to adopt or develop challenging content and performance standards by the 
1997-98 school year. It also requires States to develop and implement assessments aligned to 
those standards and accountability systems based on student performance against those standards 
by the 2000-01 school year. Assessments must be field-tested prior to implementation. Thus, by 
the spring or summer of 2000, States should be prepared to submit evidence that their final 
assessment systems are in place. 

The purpose of this guidance is twofold: 1) to inform States what would be useful evidence to 
demonstrate that they have met Title I final assessment requirements; and 2) to guide teams of 
peer reviewers who will examine evidence submitted by States and advise the Department on 
whether a State has met Title I requirements. The intent of these requirements is to help States 
develop comprehensive assessment systems that provide accurate and valid information for 
holding districts and schools accountable for student performance against State standards. 
Although this document addresses each requirement separately, reviewers and States should 
recognize that the requirements are interrelated and that decisions about whether a State has met 
the requirements will be based on comprehensive examination of the evidence submitted. 

The Peer Review Process 

To determine whether States have met Title I assessment requirements, the U.S. Department of 
Education will use a peer review process involving experts in the fields of standards and 
assessments. The review will evaluate State assessment systems against Title I requirements 
only. In other words, reviewers will examine characteristics of State assessment systems that 
will be used to hold schools and school districts accountable under Title I. They will not assess 
compliance of State assessment systems with other Federal laws such as Title VI of the Civil 
Rights Act of 1964, Section 504 of the Rehabilitation Act of 1973, or provisions of the 
Individuals with Disabilities Education Act. The fact that an assessment system meets Title I 
assessment requirements does not necessarily mean that it complies with other laws. For 
guidance on compliance with Federal civil rights laws, States may consult with the Department 
of Education’s Office for Civil Rights. 

Furthermore, the peer review process will not directly examine a State’s assessment instruments 
or specific test items. Rather, it will examine evidence compiled and submitted by each State 
that is intended to show that its assessment system meets Title I requirements. Such evidence 
may include, but is not limited to, results from alignment studies; results from validation studies; 
written policies on including and, if appropriate, providing accommodations for students with 
disabilities and limited English proficient students; written policies on native-language testing of 

November 1999 2 



LEP students; and score reports showing disaggregation of student performance data by 
statutorily specified categories. Peer reviewers will advise the Department on whether a State 
assessment system meets a particular requirement based on the totality of evidence submitted. 
They will also provide constructive feedback to help States strengthen their assessment systems. 

States are invited to submit evidence of Title I compliance as soon as they have adopted or 
developed their final assessment systems. The Department will conduct peer reviews of 
submissions received by the beginning of each quarter (January 1, April 1, July 1, and October 1) 
during that quarter. 

Statutory and Regulatory Requirements for Final Assessment Systems 

Each State must adopt or develop a final assessment system aligned to State content and 
performance standards by the beginning of the 2000-01 school year, and tests must be 
administered before the end of the 2000-01 school year. Although most States are developing 
statewide assessment systems applicable to all students in the State, note that Title I assessment 
requirements also apply to States that, instead of developing a statewide system, choose to 
develop an assessment system applicable only to students served by Title I. 

Title I requires State assessment systems to have the following characteristics: 

• Assessments must be aligned with State content and performance standards, and they must 
provide coherent information about student attainment of State standards in at least math and 
reading/language arts. 

• If the State measures the performance of all children, the same assessments must be used to 
measure the performance of students served by Title I. 

• Assessments must be administered annually to students in at least one grade in each of three 
grade ranges — grades 3 through 5, grades 6 through 9, and grades 10 through 12. 

• The assessment system must provide for 

0 participation in the assessments of all students in the grades being assessed; 

0 reasonable adaptations and appropriate accommodations for students with diverse 

learning needs, where such adaptations or accommodations are necessary to measure the 
achievement of those students relative to State standards; and 

0 inclusion of LEP students, who shall be assessed, to the extent practicable, in the 
language and form most likely to yield accurate and reliable information on what they 
know and can do to determine their mastery of skills in subjects other than English. To 
meet this requirement, States shall make every effort to use or develop linguistically 
accessible assessment measures, and they may request assistance from the Secretary if 
those measures are needed. 

• The assessment system must involve multiple approaches with up-to-date measures of 
student performance, including measures that assess complex thinking skills and 
understanding of challenging content. 

• Assessments must be used for purposes for which they are valid and reliable, and they must 
meet relevant, nationally recognized, professional and technical standards for quality. A 
State may include assessment measures that do not meet these requirements as one of 

November 1999 



3 



multiple measures if it provides sufficient information regarding its efforts to validate the 
measures and to report the results of those validation studies. 

• Assessment results must be disaggregated within each school and district by gender, major 
racial and ethnic groups, English proficiency status, migrant status, students with disabilities 
as compared to students without disabilities, and economically disadvantaged students as 
compared to students who are not economically disadvantaged. Disaggregated data must be 
included in annual school profiles. 

• The assessment system must provide individual student interpretive and descriptive reports 
that include individual scores or other information on the attainment of student performance 
standards. 

A State may request a one-year extension from the Secretary if it finds problems during field- 
testing and submits a strategy for correcting those problems. If a State has not developed or 
adopted a standards-based assessment system that measures performance in at least math and 
reading/language arts by the 2000-01 school year, and if an extension is denied, the State must 
adopt an assessment system that meets Title I requirements, such as a system adopted by another 
State and approved by the Department, if appropriate. 

Starting in the 2000-01 school year, the statewide assessment system will be the primary means 
for determining whether schools and school districts receiving Title I funds are making adequate 
progress toward educating students to high standards. In determining the progress of schools, 
States must include scores of all students assessed who have attended the school for at least a full 
academic year. In determining the progress of school districts, States must include scores of 
students who have attended school in the district for a full academic year, even if they have 
attended multiple schools. 

Because Title I makes State assessment systems central to holding schools and districts 
accountable, this document focuses on the uses of State assessment systems at the school and 
district levels. Nevertheless, peer reviewers should note that the Title I requirements listed above 
include the requirement that State assessment systems report results at the level of individual 
students. 

State and Local Roles 

Roles and responsibilities within a State assessment system are allocated at the State, district, and 
school levels. The Department’s 1997 guidance, Standards, Assessment, and Accountability, 
describes three acceptable state-local configurations for final assessment systems: 

• The state model, in which all students are assessed with a common State instrument that 
yields data for determining adequate yearly progress for all schools and school districts; 

• The mixed model, in which State assessments are supplemented by State-approved local 
assessments; and 

• The local model, in which the State uses no common instrument and instead applies uniform 
standards to approve and monitor assessment systems developed by each district. 

In implementing final assessment systems, States have two main responsibilities: 1) They must 
develop, score, and report findings from State assessments, and 2) they must promulgate rules 



November 1999 



4 



and procedures for local assessment systems, as well as monitor such systems, to ensure 
technical quality and compliance with Title I requirements. The second function is particularly 
significant in assessment systems with strong local responsibility. Yet it remains salient even for 
States with uniform statewide assessments, since many such States employ a mixed model with 
local assessments playing some role in meeting Title I requirements. 

Format of the Guidance 

This document consists of three main sections with several subsections: 

I. ' General Characteristics of the Assessment System 

, A. Content, Grade Levels, and Administration 

B. Inclusion 

II. The Core of the Assessment System 

C. Assessments Must Be Aligned to Standards 

D. Meeting Professional Standards of Technical Quality 

III. Reporting and Using Assessment Results in Accountability 

E. Providing Individual Reports 

F. Disaggregated Reporting 

G. Development of District and School Profiles 

H. Ensuring that State Assessments Are the Primary Basis for Determining LEA and 
School Progress 

I. Include Students Who Have Attended School in the LEA for a Full Academic Year 

Most subsections include five parts: 

1. Requirements: Statutory and regulatory excerpts 

2. Intent and purpose: A brief discussion of the reasoning behind the requirement 

3. Abbreviated description: A description of critical points in the requirement 

4. Full description: Detailed explanation of each point in the requirement 

5. Questions for reviewers: Questions peer reviewers will consider as they look at State 
evidence, accompanied by examples of “desirable evidence” that States might provide for 
each requirement as well as evidence likely to be considered “incomplete” or “unacceptable.” 

The document also includes four appendices. Appendix A clarifies the requirement that States 
assess LEP students “to the extent practicable” in the language and form most likely to yield 
accurate and reliable information on what these students know and can do in subjects other than 
English. Appendix B discusses the flow and possible uses of assessment information gathered at 
State and local levels. Appendix C discusses the types of evidence useful for demonstrating 
alignment between standards and assessments. Appendix D discusses the types of evidence 
useful for demonstrating technical quality. 




November 1999 



7 



5 



PART I: GENERAL CHARACTERISTICS 
Part IA. Content, Grade Levels, and Administration 

1. Requirement 



Each State plan shall demonstrate that the State has developed or adopted a set of high- 
quality, yearly student assessments, including assessments in at least mathematics and 
reading or language arts, that will be used as the primary means of determining the yearly 
performance of each local educational agency and school served under this part in 
enabling all students to meet the State’s student performance standards. Such 
assessments shall - 

> be the same assessments used to measure the performance of all children, if the State 
measures the performance of all children; 1 

> measure the proficiency of students in the academic subjects in which a State has 
adopted challenging content and student performance standards and be administered at 
some time during grades 3 through 5, grades 6 through 9, and grades 10 through 12. 

> involve multiple up-to-date measures of student performance, including measures that 
assess higher order thinking skills and understanding (Sec. 1 1 1 1(b)(3)(A), (D), and 

(E)). 



2. Intent and purpose 

The intent of these requirements is to ensure that 1) Title I students are not held to lower 
standards than other students through less rigorous assessments, or through assessments that 
measure different standards; and 2) schools and districts know how well all of their students are 
doing in relation to a common set of State standards so that schools and districts can be held 
accountable and make improvements. 

The requirement for including multiple measures has several purposes: 

1) to provide more complete measurement of the content and performance standards and 
therefore increase the validity of the inferences made about school performance; 

2) to offer a variety of opportunities for schools and districts to demonstrate performance and 
therefore increase the fairness of determinations about performance; 

3) to provide a means for the State assessment to measure a range of cognitive attributes, 
including higher order thinking skills; and 

4) to ensure that the State standards are assessed comprehensively and with the same degree of 
emphasis and depth as stated in the standards. 

States may meet these requirements by implementing statewide tests, local tests that are 
approved by the State, or both. Most States are implementing State assessment systems that 
include measures at the State, district, and school levels. Describing such a system and how the 



1 Questions regarding modifications and adaptations for students with diverse learning needs and linguistically 
accessible assessments for limited English proficient students are addressed in Part 1 B, “Inclusion.” 
November 1999 



6 



various components work together is very important for helping peer reviewers examine State 
evidence in relation to the Title I requirements. 

3. Full description 

The State assessment system may consist of standards-based measures adopted or developed by 
the State, measures adopted or developed by LEAs, or both. If determination of school and 
district progress is based in whole or in part on measures adopted/developed by LEAs, the State 
must provide criteria or models for the LEA assessments. The State also is responsible for 
monitoring the quality of such assessments. 

Content- 

The law requires the annual assessment of all students served by Title I in both mathematics and 
reading or language arts, using measures that assess higher order thinking skills and 
understanding. Although the law requires only that States assess in these content areas, the 
regulations make it clear that the Secretary of Education encourages States to broaden their 
assessments to include other content areas such as science and social studies: “If a State has 
standards and assessments for all students in subjects beyond mathematics and reading/language 
arts, the regulations do not preclude a State from including, for accountability purposes, 
additional subject areas, and the Secretary encourages them to do so.” (Federal Register, July 3, 
1995, Regulations, page 34800) 

If the State assesses mathematics or reading/language arts through testing in another content 
area, the assessments must yield information about student performance in mathematics and 
reading/language arts. If a State assesses math and reading in the same grade using a matrix 
sampling approach, it must ensure that every student is assessed in both reading and math. Also, 
if a language arts assessment is used, it must measure reading and yield information about 
student performance in reading. 



Multiple Measures- 

The Title I legislation requires the use of multiple measures. This requirement has been 
interpreted in the Department’s Guidance on Standards, Assessments, and Accountability (1997) 
to mean that different approaches and formats should be included in a State assessment system. 
Examples include criterion-referenced tests, standardized norm-referenced tests, writing samples, 
completion of graphic representations, observation checklists, performance of exemplary tasks, 
performance events, and portfolios of student work. The assessments must include measurement 
of complex skills and understanding of challenging content in at least mathematics and 
reading/language arts. 

Although multiple approaches such as a criterion-referenced test and a performance task for a 
single subject are not required, States should determine how multiple approaches are 
appropriately included in their State assessment systems based on 1 ) the nature of the content 



2 The degree to which content standards in mathematics and reading/language arts are assessed and the quality of 
coverage are addressed in Part II C, “Assessments Aligned to Standards.” 

3 The quality and use of multiple measures is addressed in Part II C, “Assessments Aligned to Standards,” and Part 
II D, "Professional Standards of Technical Quality." 

November 1999 



7 



and performance standards in each content area and 2) the contribution of the measures to the 
technical quality of the assessment system at the level of use of the results. For example, 
depending upon the findings of the State’s alignment study, multiple measures may be necessary 
in order to adequately assess the State’s standards and therefore meet the alignment requirement. 
Multiple approaches may also provide more complete information on school progress. 

For purposes of holding districts and schools accountable for making adequate yearly progress, 
the State assessment must be the primary measure used. The State may use data derived from 
other school indicators such as attendance and graduation. However, such data may not be used 
to meet the Title I requirement for multiple measures in a statewide assessment system. 

Grade Levels 



Testing only needs to occur in one grade of each gradespan, and although at least mathematics 
and reading/language arts must be assessed within each grade span, different subjects may be 
tested in different grades (e.g., reading in grade 3 and math in grade 4). The performance of each 
school served by Title I, however, must be measured. If a school does not encompass a grade 
that is within the State assessment system (e.g., a school serves grades k-2 and testing is in grade 
4), then the State must develop a means for holding such schools accountable for student 
performance. Such schools may either 1) use locally adopted or developed assessments or 2) use 
assessment information from their receiving schools that shows student performance in math and 
reading within the relevant grade spans. 

Administration 



If the State measures the performance of all students, these assessments must be used to measure 
the progress of students in Title I schools. The State should provide a description of its statewide 
assessment system that explains the purpose of its State assessment and how it is administered 
throughout the State. 

4. Preparing a State submission of evidence 

A State should submit a narrative description of its assessment system that explains the purpose 
of the system, the various components of the system, the subjects and grades assessed, and how 
assessment results are used. State submissions should address each of the peer review questions 
listed in the next section. 

A narrative description supported by other relevant documents would be most helpful. For 
example: 

• sample score reports would provide information about what is assessed and how content and 
performance standards are communicated; 

• assessment development materials such as test blueprints and item specifications would 
demonstrate that multiple measures are used and that a range of cognitive attributes are 
considered in the assessment system; 

• criteria or models that are provided to LEAs for adoption or development of local 
assessments would help explain how the State ensures quality in local components of its 
system; and 




November 1999 



10 



8 



• administration manuals that explain allowable accommodations for students with disabilities 
and LEP students would support a description of a State assessment system that includes all 
students. 

5. Questions for peer reviewers 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


Al. Does the State have a 
statewide system for 
assessing all schools in the 
selected grade spans, 
including Title I schools? 
If not, does the State at 
least have a system for 
assessing students in Title 
I schools in relation to 
performance on State 
standards? 


Reviewers will determine whether the 
State assessment system (or, if there is 
no State system, the assessment used 
for Title I purposes) includes all 
required content areas and grade 
levels, which students are covered by 
the system, and information about how 
results are used. 


The State uses a different 
system of assessment for some 
groups of students, such as Title 
1 students. 

The State uses a different 
system of assessment to 
measure achievement in Title I 
schools than it uses to measure 
the achievement in Title 1 
schools. 

The State has no provisions for 
measuring achievement in Title 
I schools. 


A2. Does the State 
assessment system 
measure the performance 
of students in Title I 
schools using a statewide 
test, local assessments, or 
some combination? 

If the State assessment 
system includes LEA- 
adopted or developed 
assessments, how does the 
State ensure the quality 
and rigor of the 
assessments? 


If local assessments are used, peer 
reviewers will look for evidence that 
the State has a means of ensuring high 
quality and rigor in local assessments. 
Evidence of monitoring the quality and 
use of local assessments might include 
State- provided criteria or models for 
local assessments, a peer review 
process for local assessments, or other 
procedures to monitor the quality of 
the assessments and their 
administration and use. 


LEAs choose their assessments 
with no oversight (either 
models of assessments or 
criteria for selection) from the 
State. 

The State does not monitor the 
quality or rigor of local 
assessments. 


A3. How does the State 
evaluate the effectiveness 
of schools that do not 
contain any of the grade 
spans covered by the State 
assessment system (e.g., k- 
2 schools)? 


Reviewers will look for evidence such 
as descriptions of LEA assessments or 
a method for matching scores from 
receiving schools. 


The State does not collect 
achievement data in schools 
with grades outside the required 
grade spans and has not 
developed methods for 
evaluating the effectiveness of 
Title I schools that do not 
contain the required grade 
spans. 



November 1999 



ll 



9 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


A4. How does the State 
incorporate multiple 
measures of student 
achievement? 


Reviewers will look for a description 
of multiple approaches or instruments, 
based on State content and 
performance standards, in the State 
system. Examples might include the 
use of multiple approaches and formats 
within a single test; the use of multiple 
assessment instruments; and the use of 
a writing test as well as a reading test 

This evidence might be found in 
descriptions of the assessment system, 
test blueprints, or item specifications. 
Evidence of the use of local 
assessments to measure content areas 
or standards will also be considered by 
reviewers. 


The State uses only a multiple- 
choice test. 


A5. Are the assessments 
administered annually, 
covering the required 
grade spans and content 
areas, incorporating the 
measurement of higher 
order thinking skills and 
understanding, and 
yielding scores in at least 
mathematics and reading? 


Reviewers will look for evidence such 
as that needed to fill in the chart 
below. This evidence may come from 
documents such as score reports, State 
reports of assessment results, test 
blueprints, or descriptions of the State 
assessment program. If the State uses 
assessments in content areas other than 
mathematics and reading/language arts 
to assess proficiency in mathematics 
and reading/language arts, reviewers 
will look for evidence of the 
production of (sub) scores in at least 
reading and math. 


The State measures only basic 
skills in reading/language arts 
or mathematics. 



o 

ERIC 



November 1999 



12 



10 



Evidence of Required Assessments, by Subject and Grade Span 





Grade Span 3-5 


Grade Span 4-8 


Grade Span 9-12 


Administered annually 








Mathematics, including 
measurement of higher 
order thinking 








Reading/language arts, 
including measurement of 
higher order thinking 








Other subjects optional 
(specify) 








Scores reported in reading 








Scores reported in math 










November 1999 



13 



n 



Part IB. Inclusion 

1. Requirement 



The State assessments shall provide for — 

> the participation in such assessments of all students in the grades being assessed; 

> the reasonable adaptations and accommodations for students with diverse learning 
needs, necessary to measure the achievement of such students relative to State content 
standards; and 

> the inclusion of limited English proficient students who shall be assessed, to the 
extent practicable, in the language and form most likely to yield accurate and reliable 
information on what such students know and can do, to determine 

such students' mastery of skills in subjects other than English. 

(Sec. 1111(b)(3)(F)) 

> The State plan shall identify the languages other than English that are present in the 
participating student population and indicate the languages for which yearly student 
assessments are not available and are needed. The State shall make every effort to 
develop such assessments and may request assistance from the Secretary if 
linguistically accessible measures are needed. 

(Sec. 1111(b)(5)) 



2. Intent and purpose 

The purposes of these requirements are 1) to ensure that all students are held to the same high 
standards and appropriately assessed against those standards; and 2) to ensure that the indicators 
used to hold schools accountable include performance data on all students in the grades being 
assessed. States are responsible for assessing all such students relative to proficiency on the 
State’s content and performance standards in mathematics and reading/language arts. States 
must show how they use a variety of strategies to make certain that all students participate in the 
assessment system. These strategies may include appropriate accommodations, alternate 
assessments, assessments in the students' primary languages, and linguistically simplified 
assessments. 

3. Full description 

States are responsible for assessing all students in the grades being assessed. Therefore, States 
must provide means to determine the achievement of students with disabilities and limited 
English proficient students relative to the State’s content and performance standards when 
standard assessment procedures do not provide this information. This may be accomplished 
through providing appropriate accommodations in setting, scheduling, presentation, and response 
formats for the standard assessment, or through developing or adopting primary-language 
assessments or alternative assessment procedures tied to the content and performance standards. 

The critical issue to consider in this section is whether the assessment system allows for 
assessing students with disabilities and limited English proficient students against the same 
content and performance standards that apply to all students. Technical quality, including 



November 1 999 



12 



reliability and validity, must be ensured if assessments are administered in a non-standard 
manner. 

Changes to Standard Assessment Procedures- 

Accommodations are changes to standard assessment conditions, including changes in setting, 
scheduling, timing, presentation, and response. For best results, accommodations should be the 
same as the instructional conditions that the student normally experiences. Determining whether 
an accommodation compromises technical quality involves judging whether the accommodation 
either alters the construct assessed or changes the performance standard. 

The best way to determine whether a specific accommodation produces a valid score (i.e., 
measures the same construct as that measured by the standard version) is through empirical 
research. Because conducting such research is time consuming and expensive, experts have 
developed several rules of thumb for categorizing some common changes to standard testing 
conditions (National Center on Educational Outcomes, 1997). For example, in most cases, 
providing a separate room for a student to take the test is considered an accommodation that 
produces a valid score. Allowing the use of a calculator on an assessment designed to measure 
calculation skills is usually considered to be an accommodation that produces an invalid score. 

Students with Disabilities 



Decisions on the types of assessment accommodations or adaptations provided to a student with 
disabilities, or the decision to use an alternate assessment, should follow standard State 
guidelines that are consistent with IDEA requirements. Decisions on each student's participation 
in State and district assessment programs must be consistent with the appropriate Federal laws 
and regulations. For some students with disabilities the appropriate law is the reauthorized 1997 
Individuals with Disabilities Education Act (IDEA). Other students with disabilities who are 
evaluated and determined to be ineligible for special education and related services under IDEA 
are provided reasonable accommodations in accordance with Section 504 of the Rehabilitation 
Act of 1973 (Section 504), as amended. 

Students with Individualized Education Programs (IEP) under IDEA: This Federal law and 
its accompanying regulations require that students with disabilities are included in State and 
district-wide assessment programs, with appropriate accommodations and modification, if 
necessary. 

IDEA also recognizes that some students with disabilities may be unable to participate in general 
State or district assessments, even with the use of appropriate accommodations. States must 
ensure that guidelines are developed for the participation of students with disabilities in alternate 
assessments for those students who cannot participate in the general assessment program. (IDEA 
requires that states develop alternate assessments by July 1, 2000.) 

A student's participation in alternate assessments or any individual assessment accommodations 




4 The terminology used for assessment alterations is confusing. The terms accommodations, modifications, 
adaptations, and alterations are sometimes used to mean the same thing, and sometimes used to mean different 
things. Because these terms are not used with uniform, consistent meaning, we only use the term “accommodation” 
here, with adjectives added to clarify whether an accommodation results in a valid score or invalid score. 
November 1999 13 



15 



or modifications needed by the student must be determined and documented in his or her IEP, by 
the student's IEP team. If a student's IEP team determines that he or she will not participate in a 
particular State or district assessment of student achievement (or part of an assessment), the 
student's IEP must include a statement of why that assessment is not appropriate for the student 
and how the student will be assessed. If IEP teams properly make individualized decisions about 
the participation of each child with a disability in State and district-wide assessments (including 
the use of appropriate accommodations and modifications in the administration, as appropriate), 
it should be necessary to use alternate assessments for only a relatively small percentage of 
children with disabilities (Federal Register, Vol. 64, No. 48, March 12, 1999, p. 12564). 

Students with 504 Plans or IEPs not Under IDEA: The student who meets the Section 504 
definition of disability must be provided reasonable accommodations, which can include special 
education and related services if determined appropriate by the Section 504 placement team. 
Decisions on the types of assessment accommodations or adaptations provided to a student 
under Section 504 should be documented in the student's IEP (if the parent and school placement 
team have agreed upon the IEP option) or 504 Plan, and they should be closely related to 
procedures used in the student's instruction. 

Alternate assessments are used when appropriate accommodations can not provide students with 
an opportunity to demonstrate their knowledge and skills. Typically, alternate assessments 
incorporate more fundamental changes to testing conditions, such as using an entirely different 
format for the assessment (e.g., a portfolio system instead of on-demand tests) or assessing 
content standards that are changed in some way (e.g., expanded standards with different 
performance descriptors). 

Students with Limited English Proficiency 

All LEP students in the grades being assessed must be a part of the State’s assessment system. 
One of the pieces of the statute that is challenging to interpret, however, is what it means to 
assess LEP students “to the extent practicable in the language and form most likely to yield 
accurate and reliable results.” The Department has developed a series of questions that a State or 
district should consider when determining the “extent practicable” for offering assessments in 
other languages and forms for LEP students. Appendix A includes this guidance and peer 
reviewers should consider how the State explains its system against this framework. 

States must identify the languages other than English that are present in their student population 
and the levels of English proficiency among their LEP students, and use this information to 
determine if assessments written in languages other than English are needed. If a State has a 
large population of LEP students who speak a single non-English language, it may be feasible 
and appropriate to provide assessments aligned with standards in those languages. In most 
States, the population of Spanish-speaking students is large enough to justify the development of 
Spanish versions of the assessments. If several different languages are spoken by LEP students 
and no single language constitutes a significant concentration, it may not be feasible to assess in 
students’ native languages, but appropriate accommodations should be offered that reflect the 
instructional approaches those students are experiencing. 

At the student level, the decision of how to best assess an LEP student should be based on 
several factors, including level of English proficiency, primary language of instruction, level of 
literacy in the native language, and number of years the student has received academic 
November 1999 



14 



instruction in English. The appropriate form of assessment might be assessing the student orally 
or in writing in his or her native language; providing accommodations such as a bilingual 
dictionary, extra time, or simplified directions; using an assessment that has been stripped of 
non-essential language complexity; or administering an English language assessment orally. 

Exemptions 

Title I does not permit States to exempt any student subgroup from their final assessment 
systems, though individual exemptions may be permitted by the State in extraordinary 
circumstances such as medical emergency or parental insistence. If the State exempts any 
students from the assessments, it must describe the exemption criteria and process. The number 
of exemptions from the assessment should be minimal and should be based upon reasonable 
criteria. Furthermore, the State should explain the procedures followed in documenting which 
students are not assessed, including auditing and record-keeping activities. The State should 
explain how it plans to reduce the number of exemptions and how it will verify that policies 
designed to increase student participation in the assessment system produce the intended effects. 

In the case of children with disabilities, the final regulations implementing IDEA require that the 
IEP team have the responsibility and the authority to determine what, if any, individual 
accommodations or modifications in the administration of state assessments are needed in order 
for a particular child with a disability to participate in the assessment. Likewise, it is the IEP 
team that must determine whether a child will not participate in a particular state assessment of 
student achievement (or part of an assessment) and if not, how that child will be assessed. 

Technical Considerations 



When assessment procedures are altered, it is critical to ensure that scores, decisions, and 
judgments based on these assessments are fair, reliable, and valid. The criteria for technical 
quality outlined in Part II E, "Professional Standards of Technical Quality," apply to modified, 
accommodated, and alternate assessments. The issue of fairness is one that is particularly salient 
in the area of assessment accommodations. Accommodations should provide students with the 
same opportunity to demonstrate their knowledge and skills as students who do not need an 
accommodation. For example, if language-related accommodations provide students such an 
opportunity, an LEP student with the same level of knowledge and skills in mathematics as a 
non-LEP student will achieve the same proficiency level on the assessment. 

4. Preparing a State submission of evidence 

States should submit three types of documentation as evidence that they have met these 
requirements. First, they should describe their analysis of the diverse learning needs of the entire 
student population (including contextual information such as the number of students with 
disabilities, the number of students needing accommodations, the languages spoken by students 
in the State, and the number of students in each language category) and the implications for 
making assessments accessible. Second, they should document policies and strategies that they 
have implemented to ensure that all students are part of the assessment system. Third, they 
should submit data on the percentage of students participating in the assessments, explanations 
for those who were not included, and strategies for including them in the future. 



O 

ERIC 



November 1999 



17 



15 



5. Questions for reviewers 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


Bl. Do the State data on 
assessment participation 
rates indicate that virtually 
all students are included in 
the assessment and that 
their scores are used to 
evaluate school and 
district progress? 


Reviewers will look for information 
describing the State's students with 
disabilities and limited English 
proficient students and their rates of 
participation in the State assessment 
system, as illustrated in the chart 
below. Peer reviewers will look for 
substantial evidence that all students 
are assessed and their performance is 
reported. They will look for effective 
strategies that are being developed and 
implemented to include any students 
who have been excluded to date. 
Evidence also might include 
descriptions of State policies that 
provide incentives for including and 
sanctions for excluding students with 
disabilities and limited English 
proficient students from the 
assessment. 


A large number of students with 
disabilities are excluded from 
the assessment system, and the 
State does not have plans for 
initiating procedures for 
including these students. 

Although students with 
disabilities and limited English 
proficient students are included 
in the system, results of their 
assessments are excluded from 
measures of school and district 
progress. 

The State has adopted policies 
permitting categorical 
exemption of students with 
disabilities and LEP students 
from the statewide assessment 
system. 



Information to Determine the Need for Test in Language(s) Other Than English 



Primary Languages in Grade 


Number of Limited English 
Proficient Students 


Language 1 




Language 2 




... etc . 





Participation Information for Grade 



General 


Number 


Total student population 




Total students with disabilities (IEP & 504) 




Total limited English proficient students 





Participation 


Included in 
Assessment 


Included in 
Measures of 
Progress 


Number of students with disabilities included in State 







November 1999 16 




18 



assessment without appropriate accommodations 






Number of students with disabilities included in State 
assessment with appropriate accommodations 






Number of students with disabilities tested with other 
State standards-based assessments (beyond 
accommodations; e.g., alternate assessment) 






Number of limited English proficient students 
included in State assessment without appropriate 
accommodations 






Number of limited English proficient students 
included in State assessment with appropriate 
accommodations 






Number of limited English proficient students tested 
with other State standards-based assessments (beyond 
accommodations; e.g., alternative, non-parallel test) 







Exemptions and Exclusions 


From Assessment 


From Measures of 
Progress 


Number of students with disabilities excluded 






Number of limited English proficient students 
excluded 








November 1999 



19 



17 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


B2 What policies does the 
State have for including 
students with disabilities 
in their assessment 
system? 

Does the State policy 
result in participation rates 
that provide meaningful 
data on how well students 
with disabilities are 
performing relative to 
State standards? 

What policies are provided 
regarding appropriate 
accommodations for 
students with disabilities 
and the use of alternate 
assessments? 


Peer reviewers will look for substantial 
evidence that the State has considered 
the needs of students with disabilities 
in both the development and 
implementation phases of its 
assessment; has effective policies in 
place for using appropriate 
accommodations; and has policies to 
ensure that IEP teams are involved in 
determining assessment 
accommodations and whether an 
alternate assessment is necessary. 

In addition to counts of participation 
illustrated in Bl, evidence might 
include State guidelines for appropriate 
accommodations; handbooks for IEP 
teams, and other policy documents. 


No accommodations are offered 
for students with disabilities. 

The State does not have policies 
for assessing students with 
disabilities who are excluded 
from the State assessment(s). 


B3. Does the State have a 
policy in place for 
maximizing the inclusion 
of LEP students in the 
statewide assessment? 

Does the State policy 
result in participation rates 
that provide meaningful 
data on how well LEP 
students are performing 
relative to State standards? 

What policies are provided 
regarding appropriate 
accommodations and 
linguistically accessible 
assessments for LEP 
students? 


Reviewers will look for evidence that 
the State has conducted an analysis of 
its LEP student population and what 
their learning needs are, including the 
use of measures of language 
proficiency; developed strategies to 
ensure that they are tested 
appropriately; and implemented 
statewide policies or guidelines for 
appropriate accommodations for LEP 
students 

Ideally, peer reviewers would like to 
see evidence that the State considered 
the needs of LEP students in the 
development and design of the State 
assessment so that it would provide 
valid and reliable results even with 
accommodations. If this has not 
occurred, then strategies to provide 
appropriate accommodations and 
multiple measures must be clearly 
described. 


State policies allow the 
exclusion of LEP students from 
participating in the State 
assessment and measures of 
program progress. 

The State has not investigated 
approaches for providing 
linguistically accessible 
assessments for LEP students. 

The State has no procedures for 
assessing excluded LEP 
students' achievement in 
relation to State content and 
performance standards. 

The State offers only an 
assessment in English without 
accommodations to students 
who have recently arrived in the 
U.S. and are not proficient in 
English. 



o 

ERIC 



November 1999 



20 



18 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


B4 Does the State offer 
native language 
assessments for some LEP 
populations? Are policies 
in place to ensure that they 
are used appropriately? If 
not, why not? Is it 
practicable to offer these 
in the future? 

Does the State require that 
staff conducting native 
language assessment 
possess adequate 
proficiency in the native 
language? Are they 
adequately prepared and 
trained in the assessment 
procedure? 


Peer reviewers will look for evidence 
that the State is at least making a 
Spanish language version of the test 
available (since over 70% of LEP 
students are Spanish speakers), unless 
the State has a very small Spanish- 
speaking population. Reviewers will 
look for evidence that the State has 
additional strategies for adopting or 
developing native language 
assessments where appropriate. 


The State does not offer native 
language assessments and has 
not shown that provision of 
linguistically accessible 
assessments to LEP students 
would be impracticable. 


B5 Do accommodations 
offered to students with 
disabilities and LEP 
students reflect the 
instructional approaches 
used with those students? 


Reviewers will look for evidence of 
standard procedures and guidelines for 
determining which accommodations 
are appropriate for individual students. 
Evidence should show how 
accommodations reflect the ways 
students learn content. 


Students are provided 
accommodations that are 
unrelated to instructional 
approaches routinely used in 
their instruction (e.g., an 
audiotaped assessment 
administration when a student 
routinely reads print material). 


B6 Do the 

accommodations offered 
to students with 
disabilities and LEP 
students provide a means 
for making valid 
inferences about the 
knowledge and skills of 
these students? Has the 
State investigated the 
technical quality of the 
accommodated scores? 


Reviewers will look for evidence that 
appropriate accommodations have 
been selected or developed in such a 
manner that valid inferences can be 
made about student proficiency in 
relation to State standards. Such 
evidence might be found in 
descriptions of expert review of and 
recommendations for appropriate 
accommodations and studies 
conducted on the effects of appropriate 
accommodations on student scores. 


The State has no rationale, 
either judgmental or empirical, 
for the accommodations it 
offers and the accommodations 
it prohibits. 



o 

ERIC 



November 1999 



21 



19 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


B7 Does the State monitor 
the application of 
inclusion policies at the 
local level? 


Peer reviewers will look for evidence 
that the State has a means of ensuring 
that inclusion and accommodation 
policies are applied consistently and 
appropriately across the State. 

Evidence might include descriptions of 
training in using inclusion and 
accommodation guidelines, a 
description of monitoring procedures, 
or a report of the results of monitoring 
application of State guidelines. 


The State does not provide 
information on how to apply its 
guidelines and does not monitor 
how closely LEAs follow the 
guidelines. 




November 1999 



22 



20 



Examples of Sources of Evidence for Part I 



Types of evidence States might use to document progress pertaining to General Characteristics 
of the Assessment System are described below. This list does not include all possible types of 
evidence; rather, it is designed to serve as a source of ideas for States as they prepare their 
evidence for review. 



Sources of Evidence 


Content, grade 
levels, & 
administration 


Inclusion 


Description of assessment system, including subject areas, grades 
assessed, and frequency of administration 


X 


X 


Sample score reports 


X 


X 


Assessment development materials (e.g., assessment blueprints, item 
specifications) 


X 


X 


Criteria/models for LEA adoption/development of assessments 


X 


X 


Description of assessments or tracking procedures for schools without 
grades covered by the State assessment 


X 


X 


Administration manual that includes descriptions of allowable 
accommodations for students with disabilities and guidelines for the 
inclusion or exemption of limited English proficient students 


X 


X 


Description of procedures for determining whether students with 
disabilities (both IEP and 504) and limited English proficient students are 
provided with appropriate accommodations/translations or exempted 
from the general assessment 




X 


Description of development of alternate assessment procedures for 
students with disabilities 




X 


Description of development of linguistically accessible assessments 




X 


Description of instruments used to assess language proficiency of limited 
English proficient students 




X 


Documentation of number of exemptions from State assessments 




X 



November 1999 



21 



PART II - THE CORE OF THE ASSESSMENT SYSTEM 

PART II - C: Assessments Must be Aligned to Standards 

1. Requirement-Legal Citation 



The State assessment shall - 

Be aligned with the State's challenging content and student performance standards and 
provide coherent information about student attainment of such standards. 

(Sec. 1 1 1 1 (b)(3)(B)) 



2. Intent and purpose 

The intent of this requirement is to ensure that assessments reflect what students are expected to 
know and be able to do. Such assessments will help guide educators in measuring student 
progress and making necessary alterations in their teaching and learning strategies to help all 
students master challenging State standards. For the purposes of this review, alignment is 
defined as the degree to which assessments provide valid and accurate information about the 
performance of all students in an academic content area at the desired level of detail on the 
State's content standards. 

3. Abbreviated Description 

Demonstrating that an assessment system is aligned to State content and student performance 
standards requires more than simply determining whether all the items on the assessment can be 
matched to one or more standards; the converse must also be probed. In other words, States 
should also determine whether the State assessment adequately measures the State’s standards. 
This can be accomplished by analyzing how well the State assessment measures State standards 
along the following dimensions: 

• Comprehensiveness: Does the assessment reflect the full range of the standards? If not, 
does it sample enough to make relevant inferences about student performance on the entire 
set of standards? Is it complemented by other measures, such as another test or local . 
measures that provide information to educators on the other standards? 

• Emphasis: Does the assessment reflect the same degree of emphasis on the different content 
standards as reflected in the standards documents? 

• Depth: Does the assessment reflect the cognitive depth of the standards? In other words, is 
the assessment as cognitively demanding as the standards? 

• Match with performance standards: Does the assessment provide scores that reflect the 
meaning of the different performance standards? 

• Clarity for users: Is the alignment between the standards and the assessment clear to all 
members of the school community? 



November 1999 



22 



4. Full Description 



Alignment has many meanings in education. In one sense, it is the core idea underlying 
standards-based school reform; reform cannot happen unless all parts of the system come 
together-not just standards and assessments but virtually all the other components of an 
educational system including teaching strategies, instructional materials, and professional 
development. 

Focusing on its systemic aspects, Webb defines alignment as the degree to which expectations 
and assessments are in agreement and serve in conjunction with one another to guide the system 
in ensuring that students learn what they are expected to know and do (Webb, 1997). More 
specific to the operational review process outlined in this document, we will define alignment as 
the degree to which assessments report valid and accurate information about the performance of 
all students in an academic content area at the desired level of detail on the State's content 
standards. 

Each State must present evidence that their assessment system is aligned to their standards. This 
general statement means many specific things, and it means different things for different States, 
depending on the design of their State systems. It means something different to States that 
custom-developed their assessments to match the standards, in contrast to those States that 
adopted an assessment based on an alignment study. However, in both cases it includes the dual 
and sometimes overlapping processes of obtaining alignment, as well as verifying it. In fact, in 
some cases it includes the process of re-verification, if changes in tests were made to improve 
alignment. 

Since this document is for peer reviewers, it only provides glimpses of procedures that States can 
use to achieve or verify alignment. Fortunately, one of the CCSSO's State Collaboratives on 
Assessment and Student Standards (SCASS) has developed a document that describes and 
illustrates several approaches to alignment and alignment verification (LaMarca, Redfield and 
Winter 1999). A CCSSO project led by Blank and Webb also developed rating procedures for 
examining alignment, and applied them to the standards and assessments from four States (Blank 
and Webb, 1999). 

It seems obvious that alignment is a two-way process, especially for States that choose to select 
an existing assessment. It is not sufficient that a State determines that all the items on the 
assessment can be matched to one or more standards; the converse must also be probed, "Are all 
the standards adequately assessed?" The following visual may help to illustrate this basic point. 




November 1999 



25 



23 



Figure 1 . The Two-Way Nature of the Standards 
and Assessment Alignment Process 



Standards 



1 . 

!kjd:flis;iir|ddiliiiilusifnkjsrj;iQjhsdf 

dkajhfdj 

als:jrda]skfdlkjalis:2k01kasfdllkfasi 

kliQasfdli 

j 

2 . Hkj hi kjhkj likjlikHja rsdfiuislKi fdiik 
sjdfa:lk jrdlkjlikjlikjlilikJiisluif I . 
lkjd:njsalf)d4iliajhasjrakisQaQihsdr 
dkajhfdj 

als:jfdaiskrdlkjahs:2knikiisfdllkfasj 

khQasfdlij 

2 . Hkjlilkjhktlikj likkj afsdfaaslkj fdak 
sj dfo:lkj rdlkj likjlikjliliklij sluifl . 
lkjd:njsalQddjhajhasjfakjsQafljlisdf 
dkajhfdj 










Assessment 

System 



1. 

Ikjd:fl.lsal0ddjliajliasjrakjs0afijlisdr 

dkajlifdj 

alsjrdalskrdlkjalis:2k01kasfdllkrasj 

kliQasfdli 

j 

2 . Hkjhl kjliKjlikj hkkjafsdfaaslkj fdak 
sjdfnilkjfdlkjlikjlikjhhklijshafl. 
Ikjd-.fljsalOddjhajliasjrakjsnafijlisdf 
dkajhfdj 

alsjrdalskfdlkjahs:2kfjlkasfdllkfasj 

kliQasfdli] 

2 . Hkjhlkjlikjlikjlikkjafsdfnaslkjfdak 
sjdra:lkjrdlkjhkjlikjhhkJijsluifl. 
lkjd:fljsalQddjliajhasjfakjsQaQjlisdf 
dkajlifdj 



Another way to make this point is with a Venn diagram. Figure 2 illustrates, hypothetically, 
what might have happened when a given State compared its standards to a given assessment. 




Figure 2. A Visual Display of Partial Alignment Between State 
Standards and State Assessment 

In the Venn diagram above, the assessment almost completely matches the standards, that is, 
only a small portion of the assessment 1 relates to knowledge and skills that are not referred to in 



1 The shaded circle is meant to represent all aspects of the assessment, for example, both a norn-referenced test and 
State-developed writing assessment. It might also include certain local assessments, pursuant to the following 
discussion. 



November 1999 



BEST COPY AVAILABLE 



24 



the standards. But, going the other direction, the story is very different— about half of the 

standards are not assessed. 

Facets of Alignment 

To satisfy the definition given above, a State's assessment system must 

a) reflect the full range (comprehensiveness) of the standards; 

b) reflect the relative emphases on the different content standards; 

c) reflect the (cognitive) depth of the standards; 

d) provide scores that reflect the meaning of the different performance standards; and 

e) make it clear and transparent to all members of the school community how the 

standards and assessments are aligned. 

The following paragraphs further describe these five elements. 

Content Alignment-Comprehensiveness . 

Comprehensiveness implies that all standards are to be assessed. This idea of 
comprehensiveness is addressed in Webb's framework as the criteria of "categorical 
concurrence" and "range of knowledge correspondence." It means that standards and 
assessments cover a comparable span of topics and ideas within categories, and do so at the 
specified level of detail (Webb, 1997). In the face of challenging content standards, it is 
unlikely that a single assessment instrument will provide the simultaneous breadth and depth 
necessary for a fully aligned system. 

The law requires only that the assessments 

• be aligned with the State's challenging content and student performance standards; 

• include the same knowledge, skills, and levels of performance expected of all 
children; and 

• measure performance in at least mathematics and reading/language arts. 

This leaves a great deal of flexibility to the States. A State must decide whether all standards 
will be assessed at all grade levels or whether to assess at selected benchmark grades; 
whether all standards will be covered in a single assessment administered at one point in 
time; whether all standards will be assessed in equal depth— some standards may be more 
complex or more important than others; whether all standards will be addressed by the State 
assessment or whether to leave the assessment of selected standards to their LEAs (either all 
the standards for some content areas, or for certain strands within the content areas of 
reading/language arts and mathematics). 

Determining how all of a State’s standards should be assessed depends upon the structure of 
the standards. Some States have broad standards, fewer than 10 per content area in any grade 
level. Other States have more detailed standards, in some cases over 30 per content area. In 
most cases, the process of translating the standards into assessment blueprints will lead 
naturally to decisions about the relative depth and breadth of sampling of each standard. For 
a State with broadly defined content standards, the assessment will probably require several 
items matched to each standard. For a State with detailed content standards, sampling within 
chunks of content may provide data that can be used to make inferences at the desired level. 



November 1999 



25 



States may choose to— 

• assess certain standards only at certain grade levels, provided that the State has a credible 
rationale for doing so; 

• assess certain content areas at the State level and others at the local level, provided that the 
identification of schools for Title I program improvement is based, at a minimum, on the 
content areas of reading/language arts and mathematics; 

• assess selected standards within the content areas of reading/language arts and mathematics 
as part of its State assessment, and allocate the remaining standards within reading/language 
arts and math to the LEAs for assessment. If the components or content strands that are 
assessed at the local level are included in the State's definition of adequate yearly progress, 
the State must monitor the local assessments to assure objectivity, accuracy, and 
comparability. 

When documenting the comprehensive aspects of alignment between standards and the State 
assessment system, the State should describe— 

• the relationships between the structure of the standards and the structure of the 
assessments; 

• the rationale for the overall alignment strategy, including a rationale for any standards 
either not assessed or not reported as part of the State assessment; and 

• the manner in which each standard is assessed, whether at the State, district, school, 
or classroom level 

• the type of information the State collects pertaining to each standard, and 

• how the State monitors the quality of the assessment data collected at the local 
level, for all assessments that are part of the statewide Title I system. 

Appendix B describes the decision-making process required to develop an assessment system 
that addresses all standards. It includes consideration of the purposes and uses of the assessment, 
the jurisdictional level at which the assessment is conducted, and the relative appropriateness of 
assessing a given standard in a formal manner on a statewide basis. 

b. Content Alignment— Emphasis. 

An aligned assessment will cover the knowledge and skills specified by the content standards 
with the same degree of emphasis as specified or implied by the standards. This is 
essentially a matter of weighting, a matter of making sure that standards that are judged more 
important than others get more weight in the computation of an overall score (whether at the 
school level or the student level). The most straightforward indicator of emphasis is the 
number of test questions per standard or subset of standards. It is important that the relative 
emphases be obvious to teachers if the assessment is going to support the aims of the 
standards. 

This use of number or proportion of items per standard is related to the use of different types 
of assessment formats, partially because different types of assessment exercises take different 
amounts of time. Performance items typically take more time, but also yield more 
information (both in a pedagogical sense and a statistical sense). The amount of time 
devoted to assessing different standards can also signal relative importance. States will need 
to work out a balance of these different factors. 



November 1999 



26 



c. Content Alignment-Depth 

It may seem that if alignment is alignment, a true match of standards and assessments would 
automatically ensure a match on other criteria such as emphasis and depth. In the real and 
slightly messy world of alignment, however, it is important to verify that the assessments 
reflect the degree of cognitive complexity and level of difficulty of the concepts and 
processes described in the standards. Webb puts it this way: ''''...what is elicited from the 
students on the assessments is as demanding cognitively as what students are expected to 
know and do as stated in the standards” (Webb, 1999, p.7). The meaning of "cognitively 
demanding" is broad, including how well students should be able to transfer their knowledge 
to different contexts and how much prerequisite knowledge they must have in order to grasp 
more sophisticated ideas. Moreover, the law calls for the assessment of complex skills and 
understanding. 

LaMarca, Redfield, and Winter describe feasible strategies to document the alignment of 
tests to content standards. At the very least, a State can study the nature of the verbs used in 
the standards and look for their manifestations in the assessment. A State might begin by 
categorizing the complexity and set of cognitive demands specified or implied by each 
standard, then develop a set of criteria for review. 

d. Alignment to Performance Standards . 

An aligned assessment reflects the nature of the student performance described in the 
performance standards, as well as the content standards. The Council of Chief State School 
Officers and the US Department of Education recently produced a handbook on performance 
standards. Handbook for the development of performance standards: Meeting the 
requirements of Title I (Hansche, 1998) that sets forth the essential characteristics of 
performance standards as a key component of a standards-based assessment system. 
Performance standards describe the level(s) of acceptable performance, specify those levels 
in operational assessment terms, and provide a mechanism for reporting the results in terms 
of the proportion of students who meet the standards. Key elements include— 

• performance descriptors — narrative descriptions of performance at each level; and 

• exemplars — examples of student work from a representative sample of all students 
that illustrate the full range of performance at a level. 

The implications are obvious: the content of the assessment must match the knowledge and 
skills described in the performance descriptors for each performance level. Furthermore, any 
tasks and student work used to illustrate the meaning of the descriptors must reflect the actual 
tasks used in the assessment. 

e. Clarity and Transparency of the Alignment. 

The alignment between standards and assessments needs to be reflected in various documents 
available to teachers, students, and parents. These users ought to be able to see easily how 
the meaning and the relative weight of the different standards are reflected in the 
assessments. Assessment reports can help to communicate this alignment, but it is likely that 
other documents will be needed also. 



November 1999 



29 



27 



Conducting the Alignment Process 

Webb (1999) suggests that both content experts and people knowledgeable about a State’s 
standards and assessments serve on review panels as part of the alignment and alignment 
verification process. These reviewers need training in the review process and should be 
monitored periodically throughout the process to ensure that they are applying the review criteria 
appropriately. 

5. Preparing a State submission of evidence 

States will be expected to describe how they are addressing each of the five aspects of alignment 
described above. The State should provide evidence that it has studied the alignment of the 
assessment and standards and, if gaps exist, that it has identified additional measures to 
adequately assess the standards. In some cases, the State may need to focus more on its plans 
than its progress in addressing these facets of alignment. Peer reviewers would then consider 
these plans as well as the documentation of what the State has already accomplished. There is no 
single best way of accomplishing the alignment or documenting the process, but it is reasonable 
to expect that a broad variety of stakeholders will be involved in the process, and that the 
assessment blueprints or specifications play a key role. 

In States that develop their assessments to fit the standards, the reviewers might expect an 
independent post hoc review to confirm successful alignment. Other assessment development 
strategies might call for two alignment reviews done at different times. The first would identify 
the relative alignment of different ready-made assessment packages, and the second would 
confirm the process after the problem of any serious gaps in assessing the standards had been 
addressed. The bottom line is that the responsibility for alignment rests with the State, regardless 
of the State-local configuration or of the assessment development strategy selected. 

6. Questions for reviewers: 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


Cl. What is the State’s 
approach to ensuring 
alignment of its standards 
and assessment? 

What kinds of alignment 
studies have been done? 
Who was involved? What 
methodology was used? 
What were the findings? 


Reviewers will look for a description 
of the State’s approach to ensuring 
alignment. They will evaluate whether 
the approach is reasonable and 
thoughtful. They will be looking for 
evidence that the State is taking a 
coherent approach to ensuring that its 
tests reflect what the State has 
determined students need to know and 
do. This almost surely will involve 
some type of alignment study. 


A checklist showing that all of 
the assessment items match one 
or more standards 

A study that did not involve 
content experts, that examined 
the alignment only at a very 
global level, or that failed to 
ensure objectivity in the process 


C2. How is the State 
ensuring that its 
assessment system reflects 
its content and 
performance standards in 
terms of 


Reviewers will look for evidence from 
the assessment plan, the assessment 
blueprints and/or item/task 
specifications that the State considered 
how all content standards would be 
assessed or how domain sampling 


An assertion of 
comprehensiveness without 
documentation matching both 
assessments to standards and 
standards to assessments. 



November 1999 



28 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


comprehensiveness and 
emphasis? 


would lead to valid inferences about 
student performance on the standards. 
They will look for descriptions and 
evidence that (a) the full scope of the 
standards and their differential 
emphases are reflected in the 
blueprints and that (b) the assessments 
match the blueprints. They will expect 
to see that impartial experts were 
involved in the process. 




C3. How is the State 
ensuring that its 
assessment reflects its 
content and performance 
standards in terms of 
depth and match with 
performance standards? 

How is the State ensuring 
that its assessment covers 
the range of cognitive 
complexity of its 
standards, not just the 
basic skills? How is the 
State ensuring that the 
assessments actually 
reflect the types of student 
performance called for in 
performance standards? 


Reviewers will look for a description 
and evidence that cognitively complex 
standards are adequately assessed. As 
in comprehensiveness and emphasis, 
reviewers will look for evidence that 
the blueprints reflect the standards that 
call for higher order or cognitively 
complex skills, and that the 
assessments match the blueprints. 


—Evidence that some 
assessment items measure 
higher order thinking, but not 
showing that most of the 
standards that call for higher 
order thinking are adequately 
assessed 

—Using a methodology that 
does not examine whether the 
more complex standards are 
assessed or whether the 
assessment tasks parallel the 
illustrative tasks in the 
performance standards 


C4. How clearly has the 
State identified any gaps 
or weaknesses and what is 
it doing to improve the 
alignment of its 
assessment and standards? 


A discussion of the gaps found and a 
description of the strategies that the 
State is putting into place to address 
them such as: 

• adding items to the assessment 

• adding multiple measures 

• adding a writing test 

• adopting the longer version of a 
test 


—Conducting alignment studies, 
even high quality studies, but 
not describing steps taken or 
planned to strengthen the 
alignment if gaps were found 


C5. If the State system 
consists of several 
assessments or draws upon 
assessment data from 
several sources, is there a 
coherent design that shows 
how all the standards are 
assessed? 


Reviewers will look for descriptions of 
the State's assessment system plan 
which describes the ways in which 
different assessments provide for 
alignment, and 

• how the results from the different 
assessments are reported, 
separately or combined (if and 
when that is appropriate); 

1 • how the results from the different 


-Simply listing the different 
assessments without showing 
how they fit together to form an 
assessment system 

—Indicating that some of the 
standards are assigned to the 
schools for assessment using 
their own instruments, without 
showing how this process leads 



o 

ERIC 



November 1999 



31 



29 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 




assessments are to be interpreted 
by the users; 

• how comparability issues are 
handled (even though this is 
mainly dealt with under "technical 
quality"); and 

• the different roles of local and 
State personnel in selecting and 
scoring the assessments, and in 
interpreting and using the 
information. 


to valid inferences about the 
effectiveness of programs in 
schools across the State 


C6. How is the alignment 
of the assessment and the 
standards communicated? 
Is it clear to educators and 
parents what is being 
assessed and how it relates 
to the standards? 


Reviewers will look for ways the State 
has used various documents such as 
manuals, bulletins, reports of results, 
and website displays to show the 
alignment and communicate this 
information both to educators and the 
public. 


—Indicating or implying that 
there really is no easy way for 
teachers or the public to see 
how or how well the 
assessments match the 
standards 



Appendix C provides additional illustrations of types and sources of evidence that a State might 
consider when studying alignment. It is also a good source for the types of evidence that peer 
reviewers might look for within each category of alignment. 




November 1999 



i 



32 



30 



PART II - D: Professional Standards of Technical Quality 

1. Requirement-Legal Citation 



State assessments shall - 

Be used for purposes for which such assessments are valid and reliable, and be consistent 
with relevant, nationally recognized professional and technical standards for such 
assessments. (Section 1111(b)(3)(C)) 



2. Intent and purpose 

The intent of this requirement is to ensure that the assessment data that are used to hold schools 
accountable are indeed technically sound and meaningful. Ensuring technical quality of 
assessments is an ongoing task that will continue as long as assessments are in place. However, 
for the purposes of this review, each State needs to document that its assessments are technically 
adequate and that it has taken reasonable steps to ensure that results are used in a manner that is 
technically sound. 

3. Abbreviated Description 

Although the law mentions only the two most well known technical characteristics, validity and 
reliability; a number of additional requirements are considered essential. Other criteria discussed 
in this section include fairness/equity; comparability; administration, scoring, analysis and 
reporting processes; and interpretation and use. 2 Peer reviewers will look for evidence of 
technical quality along six dimensions. 

a. Validity - “the appropriateness, meaningfulness, and usefulness of the specific inferences 
made from test scores” (Standards, 1985, p. 9). Peer reviewers will look for evidence that: 

• the State has considered whether the inferences drawn from the assessment are 
appropriate and meaningful; 

• the State has examined construct validity (whether the assessment actually measures the 
content and performance standards in question); and 

• the State has examined consequential validity (the validity as judged by the long-term 
impact of the results). 

b. Reliability - the level of consistency, stability, and accuracy of the assessment. The 
Standards explains reliability as follows: "Fundamental to the proper evaluation of a test are the 
identification of major sources of measurement error, the size of the errors resulting from these 
sources, the indication of the degree of reliability to be expected between pairs of scores under 



2 We might have listed self-examination and continuous improvement as criteria, since they are essential for any 
system, especially one that exists solely to improve other systems. Although not a formal requirement, States need 
to take a proactive stance and systematically seek evidence that the system is providing the best possible information 
in the best way possible. 



0 

ERIC 



November 1999 



33 



31 



particular circumstances and the generalizability of results across items, forms, raters, 
administrations, and other measurement facets" (1985, p. 19.). This is a tall order, a task that is 
beyond the present practice of many programs. The focus of the reviewers, therefore, will 
include the adequacy of the plans for, and initial steps taken to, carry out this process. 

c. Fairness/accessibility - ensuring that all students have an equal opportunity to show what 
they can do, in spite of the fact they have different backgrounds, different and complex patterns 
of abilities that interact with the assessment process itself, and different opportunities to meet the 
standards. 

d. Comparability of results -- from year to year, school to school, and student to student. 

Given the demands placed on Title I assessments to detect change, especially from year to year, 
it becomes necessary to consider comparability in designing and developing the assessment, and 
then in gathering confirmatory data during the implementation phase. Although difficult to 
implement and to document, States have an obligation to show they have made a reasonable 
effort to attain comparability, especially where locally selected assessments are part of the 
system. 

e. Administration, scoring, analysis and reporting procedures. Most states take great pains 
to ensure that the assessments are properly administered, that directions are followed, that test 
security requirements are clearly specified and followed, and that all students are assessed. 
Nevertheless, it is important they document the ways in which they ensure that their system does 
not omit any of these basics. 

f. Interpretation and use - ensuring that users of the assessment data have the support needed 
to draw the most appropriate interpretations and use the results in the most valid ways. 



3. Full description 

Only the most commonly agreed-upon principles and criteria related to technical quality are 
presented here. 3 Most of these are discussed in greater detail in two authoritative documents in 
the field, the Standards for Educational and Psychological Testing (1985) and Educational 
Measurement (Linn, 1989). Reference is also made to the draft of the next revision of the 
Standards for Educational and Psychological Testing 4 (in press) which is scheduled for 
publication as soon as the three sponsoring organizations 5 give their approval, which is expected 
later this year. 



3 The reader may notice that the various criteria, especially validity and reliability, refer at times to individual 
assessment issues and at times to issues related to the use of group-level summaries. The focus of the peer reviewers 
is on both individual and group issues, depending on the particular purpose and use of the assessment information in 
question.. 

4 As this document was being prepared, the authors consulted draft versions of the revised Standards for 
Educational and Psychological Testing in an effort to assure consistency with the 1999 Standards. 

s The American Educational Research Association, The National Council on Measurement in Education, and the American 
Psychological Association. 



November 1999 



32 



a. Validity 

This complex topic is often simplified in textbooks in the form of the quasi-tautological question, 
"Does the test measure what it purports to measure?" This turns out to be a difficult question to 
answer, sometimes leading to considerable controversy. This discussion of validity recognizes 
three relatively recent major conclusions about the definition of this elusive concept. 

• The focus of validity is not really on the test itself, but on the inferences drawn from the 
' results that it yields. 

• All validity is really a form of "construct validity." 

• In validating an assessment, one must also consider the consequences of its interpretation 
and use. 

Drawing Inferences. 

Over the years the focus has been on different types of validity, such as content validity or 
concurrent validity. It is now agreed, however, that validity is a global concept centering on 
the inferences that are drawn from a set of findings by a given user in the light of the purpose 
of the assessment. The Standards for Educational and Psychological Testing underscores this 
definition: 

Validity refers to the appropriateness, meaningfulness, and usefulness of the 
specific inferences made from test scores. Test validation is the process of 

accumulating evidence to support such inferences Although evidence may be 

accumulated in many ways, validity always refers to the degree to which that 
evidence supports the inferences that are made from the scores. The inferences 
regarding specific uses of a test are validated, not the test itself. (1985, p. 9) 

The draft revision of the Standards for Educational and Psychological Testing only serves to 
underscore this emphasis on the drawing of inferences. It goes on to assert that the various 
types of validity are really types of evidence that can be used to confirm the appropriateness 
of drawing certain types of inferences about student performance on the basis of test scores. It 
recasts the traditional types of validity in terms of types and sources of evidence, all of which 
pertain to construct validity. It speaks of four broad categories of evidence: (1) evidence 
based on the assessment's relation to other variables, (2) evidence based on student response 
processes, (3) evidence based on test content, and (4) evidence from internal structure. 

Construct Validity 

The second major transformation is a natural sequel to the first. That is the realization that 
"construct validity" is not just one of many types of validity— it is validity. All validity 
evidence and arguments are focused on the basic question, "Is the assessment tapping the 
concept, skill or trait in question? Is it really measuring mathematical reasoning or reading 
comprehension? A variety of types of evidence and analyses can be used to answer such a 
question— none of which provide a simple yes/no answer. 

The reader is reminded of the Venn diagram used in the alignment section. It illustrated the 
omission of some aspects of content by the assessment, and the undesirable inclusion of other 



November 1999 



33 



aspects of learning that were outside the scope of the standards. In the parlance of construct 
validity, the sections not assessed that should be are known as "construct under- 
representation" and the topics that are assessed, perhaps inadvertently, are known as 
"construct irrelevant variance." At the level of a content match, as in alignment, it is 
relatively easy to identify both types of mismatch and their magnitude. 

Determining whether an assessment is actually measuring what the State intended is a more 
difficult matter. The draft Standards for Educational and Psychological Testing illustrates 
how various threats to construct validity might undermine the meaning of scores on a reading 
comprehension test. If a student has a strong emotional reaction to the reading passage, for 
example, it is easy to see how the results might not be a valid estimate of his/her reading 
ability. Similarly, if the assessment calls for students to write a long response to explain their 
answers, the results for some students might be distorted by their writing ability. 

Distinguishing what is measured from what is not measured often involves the use of 
triangulation— a process of reasoning from diverse sources of evidence, including the four 
mentioned below. 

1) Using evidence based on test content (content validity). It is now widely recognized that 
content validity is one facet of construct validity that appears mainly in the validation of 
achievement tests. In fact, the question is often posed, "Is construct validity really separate 
from content validity?" Messick (1988) answers negatively: 

Typically, content-related inferences are inseparable from construct-related inferences. 

What is judged to be relevant and representative of the domain is not the surface 
content of test scores but the knowledge, skill, or other pertinent attributes measured by 
the items or tasks (1988, p. 38). 

Content validity, that is, alignment of the standards and the assessment, is important but not 
sufficient. States must document not only the surface aspects of validity illustrated by a good 
content match, but also the more substantive aspects of validity that clarify the "real" meaning 
of a score. 

2) Using evidence of the assessment's relationship with other variables. One approach is to 
document the validity of an assessment by confirming its positive relationship with other 
assessments or evidence that are known or assumed to be valid. For example, if students who 
do well on the assessment in question also do well on some trusted assessment or rating, such 
as teachers' judgments, it might be said to be valid. 

It is also useful to gather evidence about what a test does not measure. The Standards for 
Educational and Psychological Testing propose that: 

When a test is proposed as a measure of a construct, evidence should be presented to 
show that the score is more closely related to that construct when it is measured by 
different methods than it is to substantially different constructs (1985, p. 1 5). 



O 

ERIC 



November 1999 



i 



36 



34 



This means, for example, that a test of mathematical reasoning should be more highly 
correlated with another math test, or perhaps with grades in math, than with a test of scientific 
reasoning or a reading comprehension test. The most common-and complicated-example is 
found in this very area, teachers are frequently concerned that tests of mathematical reasoning 
might actually measure reading comprehension since students must be able to understand the 
problem, which is usually presented in narrative form. Although students obviously need to 
be able to read well to understand the math task, the validation challenge is to marshal 
evidence that students who do well on the assessment are not relying on their reading ability 
to answer the questions, and simultaneously— if possible— to confirm that students who are less 
skilled readers are not hindered in demonstrating their mathematical understanding. 

3) Using evidence based on student response processes. The best opportunity for detecting 
and eliminating sources of test invalidity occurs during the test development process. Items 
obviously need to be reviewed for ambiguity, irrelevant clues, and inaccuracy. More direct 
evidence bearing on the meaning of the scores can be gathered during the development 
process by asking students to "think-aloud" and describe the processes they “think” they are 
using as they struggle with the task. Many states now use this "assessment lab" approach to 
validating and refining assessment items and tasks. 

4) Using evidence based on internal structure. A variety of statistical techniques have been 
developed to study the structure of a test. These are used to study both the validity and the 
reliability of an assessment. The well-known technique of item analysis used during test 
development is actually a measure of how well a given item correlates with the other items on 
the test. If an item gets a high index, we say it is a good item, meaning that students who get 
it right also tend to do very well on most of the other items. This practice actually helps 
ensure a focus to the assessment. It means that although a reading comprehension test 
consists of items that measure different aspects of comprehension, there is a core focus that 
helps ensure the reliability of the assessment. Newer technologies including generalizability 
analyses are variations on the theme of item similarity and homogeneity. 

Other techniques are used to show whether there are certain clusters of items. Whether, for 
example, the items measuring mathematics computation tend to “hang together” and the items 
in concepts and problem solving tend to form a relatively separate cluster. Although the 
number of clusters that these statistical methods are able to identify is nearly always fewer 
than the number of content categories used in test development, it can still be a useful exercise 
as part of the package of construct validation techniques. A combination of several of these 
statistical techniques can help to ensure a balanced assessment, avoiding on the one hand, the 
assessment of a narrow range of knowledge and skills but one that shows very high reliability, 
and on the other hand, the assessment of a very wide range of content and skills, triggering a 
decrease in the consistency of the results. 

Multiple measures . One purpose of multiple measures is to ensure validity in both the 
relatively superficial sense of content validity and the deeper aspects of construct validity. 
Different types of measures and tasks, including the use of different testing formats, are 
needed to assess different content standards and to measure the different types of knowledge 



November 1999 



37 



35 



and skill represented by those standards. Multiple measures can also play a role in ensuring 
the validity of interpretations of performance for diverse populations. 

Consequential Aspects of Validity . 

The third major shift in recent thinking is that the evaluation of an assessment must also look 
at the consequences of the assessment, including the application of the results. Messick 
(1989) points out that test interpretation and use are different functions, and that the impact of 
an assessment can be traced either to an interpretation or to how it is used. He also notes that 
if we are trying to see if an assessment is "doing the job" it is quite natural that we look at the 
consequences of the assessment. In fact it is rather amazing, in retrospect, that this is a new 
realization! 

The point is that the functional worth of the testing depends. . ..on the 
consequences of the outcomes produced, because the values captured in the 
outcomes are at least as important as the values unleashed in the goals (Messick, 

1989, p. 85). 

Furthermore, as in all evaluative endeavors, we must attend not only to the effects, but to the 
side effects. 

Judging validity in terms of whether a test does the job it is employed to 
do. . .requires evaluation of the intended and unintended social consequences 
of test interpretation and use (Messick, 1989, p. 84). 

The array of possible consequences for individual students or groups of students is wide. The 
analysis of consequences is often focused on the unintended or unnoticed consequences of the 
assessment. The disproportional placement of certain categories of students in special 
education is an example of an unintended— and negative— consequence of what had been 
considered proper use of instruments that were considered valid. More recently, assessment 
has been used as a policy tool to help focus instruction on certain valued outcomes. On the 
other hand, if the assessment narrowly focuses on certain types of skills, it can have a negative 
impact on instruction— and learning. Messick (1989) chose to focus on this very example of 
unintended consequences: 

[Consequential aspects of validity] require evaluation of the intended and 

unintended social consequences of test interpretation and use. For example, the 
use in educational achievement tests of structured response formats such as 
multiple-choice (as opposed to constructed responses) might lead to increased 
emphasis on memory and analysis in teaching and learning at the expense of 
divergent production and synthesis (1989 p. 39). 

b. Reliability 

The term “reliability” is usually defined with synonyms such as consistency, stability, and 
accuracy. These terms all relate to the problem of uncertainty in making an inference about a 
score. As reflected in the Standards for Educational and Psychological Testing, the field now 




November 1999 



38 



36 



treats reliability as a study of the many sources of unwanted variation in assessment results. 6 
Those responsible for developing and operating State assessment systems are obliged to (1) 
make a reasonable effort to determine the types of error that may (unwittingly) distort 
interpretations of the findings, (2) estimate their magnitude, and (3) make every possible 
effort to alert the users to this lack of certainty. The Standards for Educational and 
Psychological Testing puts it this way: 

Fundamental to the proper evaluation of a test are the identification of major 
sources of measurement error, the size of the errors resulting from these sources, 
the indication of the degree of reliability to be expected between pairs of scores 
under particular circumstances and the generalizability of results across items, 
forms, raters, administrations, and other measurement facets (1985, p. 19). 

This is a tall order, a task that is beyond the present practice of many programs. The focus of 
the reviewers, therefore, will include the adequacy of the plans for and initial steps taken to 
carry out this process. 

The reliability of an assessment, or lack of undesirable variability, is a function of many 
factors. Three of the factors most relevant to State assessment are briefly discussed below. 

Sampling . Assessment is essentially a sampling problem. That is, it is matter of sampling 
from a domain of all the skills that could be assessed; therefore, students need the opportunity 
to take a sufficiently large sample of items or tasks in order to yield a stable estimate of their 
level of performance. The relationships between number of items each student takes and the 
consistency of the scores are well known; for example, the amount of improvement in 
reliability is great when moving from a small to a medium number of items, but after a certain 
number of items, the improvement is relatively trivial. As mentioned under validity, the law 
calls for the use of multiple measures. The implications of multiple measures for reliability 
are obvious; increases in score consistency and stability result from the administration of 
additional exercises, whether they are administered at one time or over a period of time 
(yielding other useful information in the process). Fortunately, when using assessment results 
for school accountability the problem is substantially reduced, since errors in estimating 
individual student performance tend to cancel out at the group level. The use of matrix- 
sampling increases the stability of the results at the school level still further. 

Level of challenge . In order to show what they can do, students need to respond to tasks that 
are within their range of knowledge and their skill level. If the assessment taps content that 
students have not been exposed to, they will not respond or will respond randomly. Similarly 
if the level of the assessment is far below their level of functioning, their scores will be less 
accurate, either over- or under-estimating their actual performance. 

Rater accuracy . A third issue has come to the forefront in recent years with the increasing use 
of essay tests and other performance assessments: the degree of agreement of those rating the 



6 And the magnitude of these errors is often larger than has commonly been reported. 
November 1999 



37 



results. Even back in 1985, the Standards for Educational and Psychological Testing stressed 
the obligation to report the degree of consistency among the raters 7 : 

Where judgmental processes enter into the scoring of a test, evidence on the 
degree of agreement between independent scorings should be provided (1985, 

p. 22). 

Reporting level of accuracy . 

The information or evidence provided by States on the stability of their assessments will 
indicate how well these and other issues have been addressed. The traditional methods of 
portraying the consistency of test results, including reliability coefficients and standard errors 
of measurement, should be augmented by, if not replaced with, techniques that more 
accurately and visibly portray the actual level of accuracy (Rogosa, 1995, Young and Yoon, 
1999). Most of these methods focus on error in terms of the probability that a student with a 
given score, or pattern of scores, is properly classified at a given performance level, such as 
"proficient." For school-level or district-level results, the report would indicate the estimated 
amount of error associated with the percent of students classified at each performance level. 
For example, if a school reported that 47% of its students were proficient, the report might say 
that the reader could be confident at the 95% level that the school's true percent of students at 
the proficient level is between 33% and 61%. Furthermore, since the focus on results in a 
Title I context is on growth, the report should also indicate the accuracy of the year-to-year 
changes in scores. 

For reliability, the obligation of the States is two-fold. First, they need to document the 
reliability of the scores at the student level (unless a matrix-sampling design is employed) and 
the school level. Second, they need to show that they are taking all reasonable steps to 
inform, in the most meaningful way possible, the consumers of the student and school reports 
of the level of accuracy of the results. 

c. Fairness/Accessibility 

Fairness could well be considered a facet of validity, since it poses the question, "Is the 
inference that one would draw about a student's performance on this assessment valid, or is 
there something about the assessment or its interpretation which prevents a clear affirmative 
answer?" However, fairness is treated separately in order to help ensure that States do not 
overlook any known trouble spots, and to help them develop an effective plan to identify and 
eliminate them. It is also treated separately in the draft of the revision of the Standards for 
Educational and Psychological Testing. 

Like validity, fairness has been the subject of considerable research and much has been 
written about it. Nevertheless, it suffers from the lack of a single specific definition. For the 
purposes of this review document, fairness means that all students have an equal opportunity 
to show what they can do, in spite of the fact that they have different backgrounds, different 




7 The real issue, of course, is not the "scoring reliability" but rather the overall "score reliability," which includes 
various other types of error variance as well as scorer reliability. 

November 1999 38 



40 



and complex patterns of abilities that interact with the assessment process itself, and different 
opportunities to meet the standards. 

The draft version of the Standards for Educational and Psychological Testing identifies 
several types or sources of unfairness: 

• bias or unequal treatment of students in the assessment process or in the processes of 
reporting, interpretation or use; 

• the lack of opportunity to learn to the standards. 

It is especially important that States take steps to ensure fairness for the populations that may 
have been victims of unfair assessment in the past. These populations include the very target 
of Title I programs— students from poverty— as well as English language learners and students 
with disabilities. Many of the most critical issues involved in ensuring fairness for these latter 
groups were treated in Part I under the topic of assessing all students. The strategies for 
assessing these students, including assessing students in their primary language and using 
accommodations and alternate assessments, are still being developed, studied, and refined. 
Admittedly, there is controversy about their use. The use of any of these strategies at this 
point does not produce incontrovertible evidence of fairness, validity, reliability, or 
comparability. Nevertheless, the State must describe the steps it is taking, and its plans for 
making the assessments as fair as possible. (And the solutions will only come as States try 
different approaches and provide detailed information on the results of their efforts.) 

Unfairness most often appears at four points in the assessment process. These four points 
might serve as a framework for States to use in attacking the problem-and for reviewers to 
use in judging the adequacy of their efforts. 

• The items or tasks do not provide an equal opportunity for all students to fully 
demonstrate their knowledge and skills. 

This issue can be addressed through an aligned assessment that provides all students 
in the system the opportunity to demonstrate their proficiency relative to the content 
standards. It allows students who have learned the content in different ways, students 
with disabilities, and students who are English language learners to fully demonstrate 
their knowledge and skills. Assessments should allow for 

—different ways of expressing competency and responding to tasks, (i.e., accessibility 
Note: Appendix D provides additional details on the important area of accessibility) 
—the use of accommodations and modifications, 

—the screening for bias and irrelevant factors, and 
—the empirical study of items and tasks. 

• The assessments are not administered in ways that ensure fairness. 

• The results are not reported in ways that ensure fairness. 

• The results are not interpreted or used in ways that lead to equal treatment. 

(Note: These four points are illustrated further in Appendix D, Summary of Technical 
Quality Criteria and Illustrative Evidence) 




November 1999 



41 



39 



Finally, States are reminded of the requirement for the use of multiple measures, which can be 
a part of the total solution. Students that may not be able to demonstrate their skills 
effectively on one type of assessment may do very well on another. 

g 

d. Comparability of Results (not fiscal comparability!) 

Many uses of State assessment results assume comparability of different types: comparability 
from year to year, from student to student, and from school to school. To some degree this 
can be thought of as a natural part of validity and reliability; in fact, some have referred to it 
as system reliability. Nevertheless, given the demands placed on Title I assessments to detect 
change, especially from year to year, it becomes necessary to consider comparability in 
designing and developing the assessment, and then in gathering confirmatory data during the 
implementation phase. Although difficult to implement and to document, States have an 
obligation to show that they have made a reasonable effort to attain comparability, especially 
where locally-selected assessments are part of the system. 

e. Procedures for test administration, scoring, data analysis, and reporting 

Most States take great pains to ensure that the assessments are properly administered, that 
directions are followed, and that test security requirements are clearly specified and followed. 
Nevertheless, it is important they document the ways in which they ensure that their system 
does not omit any of these basics. 

f. Interpretation and use 

Although this topic is closely related to that of validity, and is discussed in most of the other 
topics in this section, it is mentioned here because of its importance. Even if an assessment is 
carefully designed, constructed and implemented, it all can come to naught if users are not 
helped to draw the most appropriate interpretations and to use the results in the most valid 
ways. 

Technical quality and stages of development . Technical quality relates to the three main stages 
or phases of the development and implementation of an assessment system: 

• Design and development 

• Initial implementation 

• On-going revision and improvement 

These stages present opportunities both to ensure quality and to document that quality. Some of 
the elements are more related to the initial stages, some to the implementation/maintenance 
phases. A few implications are briefly mentioned below. 

• At the design/development stage, the State has the best opportunity to focus on 
validity, reliability and fairness. This is the appropriate time to ensure that the 
assessment is aligned, that it is long enough to yield reliable scores, and that the items 



8 This is not to be confused with requirements of program or fiscal comparability. 
November 1999 



40 



give all students a fair opportunity to demonstrate their skills. These are standard 
assessment development processes, and States ought to be able to present substantial 
evidence of technical quality. Content validity is usually ensured by the development 
process, including the method of translating the standards into assessment blueprints 
or specifications, involving teachers and content specialists in the process, and on- 
going, systematic matching of the assessment items and tasks with the standards. 

This is also a way to document alignment. All of the issues discussed under the 
alignment section are relevant here, not the least of which is assuring that the 
standards which call for higher-order skills and understanding are adequately 
assessed. 

• As part of the initial implementation phase, many States conduct studies to verify 
that the design principles actually produced an assessment with the qualities desired. 
For example, States can exploit the larger samples of student data to confirm the 
technical characteristics of the assessment tasks, including the fairness of the tasks for 
different student populations, and to confirm the link between the assessments and the 
performance standards. States should also be able to describe steps to ensure proper 
administration, scoring, analytic procedures, and reporting practices. 

• During the on-going annual administration of the assessment program, if not 
before, States will want to confirm the proper use and interpretation of the results. 
This often leads to, and is done as part of, a statewide staff development effort 
focused on the use of the results to help teachers better identify and strengthen 
weaknesses in the instructional program. 

Construct validation efforts continue throughout the life of the assessment. Evidence 
should continually be sought that the results truly reflect the goals of instruction, 
especially those related to higher-order thinking and understanding. In fact, with the 
spotlight of accountability on assessment results, it is all the more important to be sure 
that the assessment— which might not have changed at all— is still assessing the same 
skills. Under the pressures of accountability, steps taken to improve scores can change 
the natural relationships between instruction and assessment. Assessment items that 
ordinarily tap higher-order thinking skills might actually reflect more rote skills if certain 
types of test-preparation efforts are used. The unfortunate side-effects of accountability 
also make it advisable for States to document more basic aspects of quality, including the 
fact that students have not been deliberately taught the actual assessment tasks or clones 
of the items that would spuriously improve results. 

Finally, it is obviously not possible to study the consequences of an assessment until it 
has been implemented for a year or more. One approach to this validation effort might be 
to pose a number of questions, then search for links to the assessment results. For 
example: Are more students meeting the standards because the results led to the creation 
of a dynamic statewide after-school program? Are more students being retained in grade 
as a result of the assessment results? Are more teachers part of a long-term professional 
development program that improves the teaching of reading to low-achieving students? 




November 1999 



43 



41 



5. Preparing a State submission of evidence 



Evidence to support the existence of quality for each of the six characteristics of technical quality 
may take many forms, including requests for proposals; technical manuals; instructions and 
materials associated with the assessments and the reports; professional development descriptions 
and materials; and other descriptive materials. Appendix D outlines some illustrative types and 
sources of evidence that peer reviewers might look for under each category of technical quality. 

States are expected to present a persuasive body of evidence to support the quality of their 
assessments, including evidence about the quality of the assessment instruments themselves and 
evidence about how they are used. Although the States are expected to have some validation 
evidence in hand, in reality, validation requires the accumulation of information from many 
sources over time. Most States will be judged on the basis of the quality and thoughtfulness of 
their long-range plans for obtaining evidence showing that (1) the assessment instruments do in 
fact assess the intent of their standards, that (2) assessment information is interpreted and used 
properly, and that (3) unintended negative consequences are minimized. This scenario is 
consistent with the nature of technical quality— it is not a simple "have-not have" issue, but a 
process of continuous improvement and successive documentation over the years. 

6. Questions for peer reviewers: 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


Dl. How has the State 
considered the issue of 
validity (in addition to the 
alignment of the assessment 
with the content standards) and 
taken steps to ascertain that the 
assessments are measuring the 
knowledge and skills 
described in the standards-and 
that the interpretations are 
appropriate? 

Has the State specified the 
purposes for the assessments, 
delineating the types of uses 
and decisions most appropriate 
to each? 


Peer reviewers will look for evidence 
of construct validity, consequential 
validity, and evidence that State and 
local users draw valid inferences from 
the assessments. 

They will want to see that the State 
took care in developing the 
assessment (meaning it conducted 
field tests and various types of 
research efforts) to be sure that the 
items and tasks actually tapped the 
essence of the standards— and that it 
did so for students of diverse 
backgrounds. 

In addition, they will want to see that 
the State has a systematic plan for 
conducting on-going validation 
studies to see if the results should be 
trusted. For example, it may want to 
compare the assessment results with 
other assessment information and/or 
with the quality of work that students 


The State conducted an 
alignment study, but has no 
plans for studying the 
assessment to see if it actually 
assesses what it claims to, or to 
seek to identify types of 
students or schools where the 
results are not valid because for 
one reason or another the 
assessment does not function as 
it was designed. 




November 1999 



44 



42 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 




are actually producing in class. 
Moreover, validation studies should 
document the impact of the 
assessments. For example, they may 
want to see if the assessments have 
had a positive impact on classroom 
practice - e.g., whether they are doing 
more writing and thought-provoking 
project work, or whether they are 
spending an inordinate amount of 
time in lower-level test prep 
activities. 




D2. How comprehensively 
has the State determined that 
its assessments provide 
consistent and reliable results 
for individual students, 
schools, and LEAs? Does the 
State include information in its 
reports about the level of 
reliability of its scores? 


Peer reviewers will look for evidence 
from the design of the test, analyses 
of test and scoring data, procedures 
for ensuring rater reliability, steps 
taken to ensure reliability of school- 
level scores, and communications and 
training opportunities for schools and 
the public to understand the level of 
reliability of the assessment. For 
example, one State required in its 
request for proposals for developing 
the assessment that the bidders 
provide evidence on the relative 
advantages and disadvantages of 
various test lengths and 
configurations, given the purposes for 
the different components of its 
assessment system. This information 
was examined by State staff and its 
technical advisory committee before 
making final decisions. 


The State uses a short version 
of a standardized test not only 
for school-level assessment 
purposes but also for making 
important decisions about 
student promotion and 
placement. 


D3. What steps has the State 
taken steps to ensure the 
fairness and accessibility of 
the assessments? 


Peer reviewers will look for evidence 
that the State has taken steps to ensure 
fairness in the development of items 
and tasks, including the conduct of 
bias studies; in the administration of 
the test; and in the reporting of 
results. They will expect to see how 
accommodations and alternate 
assessments are used to help students 
respond to tasks in a meaningful 
fashion, as well as statewide figures 
on the numbers of students who used 
different accommodations and the 
achievement results for each group. 
States are expected to demonstrate 
that they assess a high percentage of 


A large number of students are 
not assessed and the State has 
no clear plans for increasing the 
proportion assessed, or has no 
program for confirming that the 
results are valid for those 
students who are assessed. 



November 1999 



43 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 




all students, that they have a solid 
plan for increasing that percentage, 
and that LEAs have an incentive for 
assessing as many students as 
possible. 




D4. How are multiple 
measures used to meet the 
criteria of validity, reliability, 
and fairness? 


The State developed a matrix of the 
ways in which multiple measures 
might enhance the technical qualities 
of the assessments; this became a 
template that guided the initial design 
of the program and the assessment 
blueprint. The State can show how 
different measures and the use of 
different formats and strategies are 
used to increase the validity, 
reliability and fairness for each of its 
assessments for each of its population 
groups. 


The State uses a single norm- 
referenced test and counts some 
items in more than one domain 
to accomplish coverage of all of 
the standards. 


D5. In what way does the 
State ensure that the 
assessment results are 
comparable for different 
schools and for different 
years? 


Peer reviewers will look for evidence 
of year-to-year consistency in 
development, administration, scoring, 
and analysis procedures, as well as 
evidence that the item content and 
focus and level of challenge are 
maintained from year to year, 
including the use of statistical 
procedures to link scores on different 
forms of the tests. 




D6. What evidence does the 
State have that its 
administration, scoring, 
analysis, and reporting 
procedures consistently meet 
high technical standards? 


The State developed a set of criteria 
or standards for each of these 
components. It requires its 
contractors to provide specific 
information on the degree to which 
each criterion is met. This 
information is then reviewed by the 
State staff and appropriate advisory 
committees. 


There are no procedures for 
ensuring that teachers or 
students do not have 
inappropriate access to the 
assessments. 

There are no procedures for 
ensuring that the scoring of 
open-ended tasks meets 
industry standards for accuracy. 


D7. What actions has the 
State taken to ensure that 
teachers, other educators, and 
parents properly interpret and 
use the results? How does the 
State help them take into 
account the accuracy of the 
results when making 
interpretations? 


The reports themselves contain 
considerable information and use 
graphics to aid proper understanding. 
The State routinely prepares and 
distributes brochures and manuals 
specifically designed for different 
audiences to help them interpret the 
results. These documents contain a 
variety of scenarios that illustrate 
different problems and issues. It also 


The results are distributed with 
a minimum of supplementary 
information, including, for 
example, only a very brief 
definition of each of the figures 
in the report. 



November 1999 



44 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 




conducts annual workshops for school 
personnel focusing on the different 
reports and how they can be used. 




D8. What steps is the State 
taking to periodically review 
and improve its assessments? 


The State has several advisory 
committees that monitor different 
aspects of the assessment system and 
review the results of periodic studies 
of problem areas. The State’s 
assessment statutes call for an 
objective evaluation every five years, 
and the State's assessment budget 
specifically provides for the conduct 
of evaluations and research studies, 
and for the ongoing upgrading of the 
assessments. 


The State has no plan or 
procedure for improving the 
assessment. 




November 1999 



47 



45 



PART III. Reporting and Using Assessment Results in 

Accountability 

The last two sets of requirements pertain to the reporting of results of the assessment and the use 
of the results for determining adequate yearly progress of the schools. They are: 

Reporting requirements: 

E. Providing individual student reports 

F. Providing disaggregated group reports 

G. Development and dissemination of school performance profiles 

Using assessment information for accountability purposes: 

H. Ensuring that State assessment is the primary basis for determining adequate 
yearly progress (AYP) 

I. Including students who have attended the same school for a full academic year 

Part III - E: Providing Individual Reports 

1. Requirement-Legal Citation: 



State assessments shall - 

Provide individual student interpretive and descriptive reports, which shall include scores, 
or other information on the attainment of student performance standards. 

(Sec. 1111(b)(3)(H)) 



2. Intent and Purpose 

The intent of this requirement is to ensure that some level of individual student reporting is 
available as part of the assessment so that students, teachers, and parents have access to 
information about individual student performance. Learning is essentially an individual matter; 
improving performance without feedback is inefficient at best, and hopeless at worst. 

3. Description 

The statutory requirement does not specify how extensive or detailed individual student reports 
need to be. However, it is important that individual student data be reported in relation to the 
State’s content and performance standards. 

Some State assessments are using matrix sampling procedures that are not designed to provide 
complete data on each student since no student takes the entire exam. Such States must provide 
some level of student reporting either on the portions of the test taken, or from other sources of 
information that relate to the State's standards. 



November 1999 



46 



4. Preparing a State submission of evidence 



States should provide examples of student reports, descriptions of the types of information that 
the reports include, the sources of the data on the reports, the general ways in which the results 
are presented, the frequency and timeliness of the reports, the ways in which various types of 
reports are used to inform parents of their children's progress, how the reports are used by school 
personnel to improve programs, and how all users are trained to properly interpret the findings. 

5. Peer Reviewer Questions 



Peer reviewer questions 


Desirable Evidence 


Incomplete or Unacceptable 
Evidence 


El. How does the State 
provide individual student 
reports? What is the source of 
the data? 


The State provides individual 
information from the State assessment, 
or it requires that LEAs report the 
results of other assessments. 


The State provides student reports 
with course grades that do not 
relate that information to the 
State’s content and performance 
standards. 


E2. What is contained in the 
student reports? How are the 
data presented? Are the results 
based on the State's content and 
performance standards? 


The reports indicate how well each 
student has performed relative to the 
content and performance standards, 
using both narrative and graphic modes. 


Student reports are based on a 
matrix-sampling design that 
provides information on some 
parts of some standards, with no 
provision for reporting on the 
other standards. 

Reports only give composite 
national percentile ranks without 
linking the results to the 
standards. 


E3. How does the State ensure 
the quality of these reports? 


The State monitors the quality of all 
contractor-produced reports using State 
assessment information, and/or annually 
monitors the quality of LEA- produced 
reports against criteria that have been 
developed and disseminated. 




E4. How are the results 
disseminated and 
communicated? Are they clear 
and understandable? 


A description of strategies to ensure that 
individual reports go to all parents in 
understandable ways; that ensure that 
parents can see how their children do in 
relation to the standards; and that the 
reports show how much students have 
progressed since the last assessment. 




E5. How is the State 
supporting the appropriate 
interpretation and use of the 
student level reports? 


The State produces interpretive 
guidelines and manuals. The State 
conducts training for local personnel in 
ways to improve usefulness of 
individual reports. 

The reports describe the amount of error 
that is associated with each score. 


Reports imply that the results are 
without error. 



O 

ERIC 



November 1999 



49 



47 



